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Abstract. The goal of the paper is to relate complexity measures asso- 
ciated with the evaluation of Boolean functions (certificate complexity, 
decision tree complexity) and learning dimensions used to characterize 
exact learning (teaching dimension, extended teaching dimension). The 
high level motivation is to discover non-trivial relations between exact 
learning of an unknown concept and testing whether an unknown con- 
cept is part of a concept class or not. Concretely, the goal is to provide 
lower and upper bounds of complexity measures for one problem type in 
terms of the other. 



1 Introduction 

The problem of learning a function from a concept class should be connected or 
easy to relate to the problem of deciding whether the function is in the function 
class or not. Imagine that one searches an element in an ordered set: the time it 
takes to find the element in the worst case scales with the logarithm of the size 
of the set. But the same fact is true (in the worst case) for testing whether the 
element is in the set or not. Thus, in exact learning, a hypothesis space could be 
viewed as playing the role of a (partially) ordered set and a target function as 
having the role of the element that is searched or tested for membership. 

Our main goal is to discuss relations between learning and computing Boolean 
functions in a setting where a friendly 'teacher' provides the shortest proofs to 
exactly identify a function in a class or to evaluat^ it. On the learning side this 
protocol is known as exact learning with a teacher [6] while on the computational 
side it is known as the non-deterministic decision tree model [5]. We will focus 
both on the worst case versions of the complexity measures for exact learning and 
computation of Boolean functions and their average case counterparts ([TO], HH) 
as it has been observed that the worst case complexity measures are sometimes 
unreasonably large even for simple concept classes. 

A natural way to interpret the non-deterministic decision tree model and the 
protocol for learning with a teacher is as best case (but non-trivial) scenarios 
for evaluation and learning. We are interested in these protocols as any hardness 
results in such settings establish a natural limit for any other evaluation or exact 



1 We will use the terms 'evaluate' and 'compute' interchangeably in the paper. 
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learning protocol. That being said, we will also investigate the aforementioned 
relations in a setting where the agent has the power to do queries. In this con- 
text, we will briefly study the relations between the decision tree complexity 
of Boolean functions (on the evaluation side) and the query complexity (usually 
measured using a combinatorial measure called extended teaching dimension [7] ) 
of exact learning with membership queries [1] . 

Motivation Our motivation is two-fold. From a purely theoretical perspec- 
tive we think it is interesting to formally study relations between combinatorial 
measures like teaching dimension and certificate / decision tree complexity as 
such relations could provide useful tools for proving lower and upper bounds in 
learning theory. 

From a more applied perspective, the motivation is very similar to the one 
connecting learning and property testing (viewed as a relaxation of the learning 
problem [E]). The intuition is that evaluating whether a particular concept is 
part of a concept class or is 'far' from being in the concept class should be an 
easier problem than learning the concept accurately. Thus, if multiple hypothesis 
classes are candidates for being parent classes for the target concept (like in 
agnostic learning) , it might be worth running a testing algorithm before actually 
learning the concept to determine which function class to use as a hypothesis 
space or, alternatively, which function classes to eliminate from consideration. 

Related Work. On the learning side, since the introduction of the teaching 
protocol and its associated notion of teaching dimension [6], there were several 
papers that described bounds for this complexity measure for various concept 
classes: (monotone) monomials, (monotone) DNFs, geometrical concepts, jun- 
tas, linear threshold functions ([2], [H]). One of the early observations was that 
sometimes, even for simple concept classes, the worst case teaching dimension 
was trivially large, which contradicts the intuition that 'teaching' should be rel- 
atively 'easy' for naturally occurring concept classes. There are several relatively 
recent attempts ([3], [H]) to change the model so as to better capture this intu- 
ition by allowing the learner or the teacher to assume more about each other. 

Another perspective on better capturing the overall difficulty of learning 
a concept class in the teaching model is to consider the average case version 
of the teaching dimension. The general case for a function class of size m was 
solved by |10| who proved that 0(^/m) samples are enough to learn any function 
class, while there exists a function class for which f2(y/m) sample are necessary. 
For particular concept classes, somewhat surprisingly, the average case bounds 
are actually much smaller: some hypothesis classes (DNFs [11], LTFs [2]) have 
bounds on the average teaching dimension that scale with (9(log(m)) while others 
(juntas [TT]) are even independent of m. 

One intuitive reason for these gaps is that the general case upper bound is 
actually uninformative for large concept classes (when m is large — m > \X\ 2 — , 
a better upper bound is the trivial \X\ that shows the learner all instances), 
whereas the proofs for particular concept classes actually take advantage of the 
specific structure of a class to derive meaningful upper bounds. 
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On the computational side, the (non-) deterministic decision tree model is 
relatively well understood (see [5] for an excellent survey on the topic). Com- 
plexity measures like certificate complexity, sensitivity, block sensitivity, decision 
tree complexity are used to quantify the difficulty of evaluating a Boolean func- 
tion when access to the inputs is provided either by an 'all-knowing teacher' (the 
non-deterministic decision tree model) or via a query oracle (the deterministic 
decision tree model). While most of the results deal with the worst case versions 
of the aforementioned complexity measures, bounds for some of their average 
case versions appeared in the literature. Among them, we mention the results 
from [4 which addresses the problem of the gap between average block sensitiv- 
ity and average sensitivity of a Boolean function — a well known open problem 
for the worst case versions of the complexity measures. 

Contributions. The first result (Section [3]) is that the teaching dimension 
and the certificate that a function is part of a hypothesis class (i.e. 1-certificate 
complexity) play a dual role: when a class is 'easy' to teach it is 'hard' to certify 
its membership and vice- versa. The second contribution (Section [4]) is to give 
lower bounds for the general case of the average non-membership certificate 
size. The results have several applications to learning and computing Boolean 
functions. Finally, we will describe structural properties of Boolean functions 
that point to connections between learning and computation in a setting that 
relates the (easier) teaching model with the (harder) query model (Section [5]). 

2 Setting and Notation 

Let J* = {fi}ie[ m ] with fa : X — > {0, 1} a class of m Boolean functions and let 
CT = {/ : X —> {0, 1}|/ ^ J 7 } be its complement (we denote [m] = {l..m}). T 
itself can be seen as a Boolean function, T : 2 X — > {0, 1}, F(f) = 1 iff / e T . 
We will usually consider X to be {0, l} n and we will label elements x € X as 
instances or examples and x^ , i G [n] as the n Boolean variables that describe 
x. In what follows, it is assumed that both the nature and the agent know J 7 , 
with nature choosing f t £ {0, l} 2 in an adversarial manner while the agent is 
not aware of the identity of ft . We will first describe the learning problem, then 
the computation problem and then discuss how they are related. 

Learning with a Teacher. On the learning side, we will focus on exact 
learning with a teacher in the loop. In this protocol, nature chooses f t E T 
and the learner knows f t is in J- but is not aware of its identity. The learner 
receives samples (pairs of (x,label(x)) with x £ X) from a 'teacher', without 
knowing whether the teacher is well-intentioned or not. The goal of the learner 
is to uniquely identify the hidden function f t using as few samples as possible. 
The teacher is an optimal algorithm, aware of the identity of f t , that gives 
the learner that most informative set of instances so that the learner uniquely 
identifies the target concept as fast as possible. The teacher is not allowed to 
make any assumptions about the learning algorithm, other than assuming it is 
consistent (i.e. that it maintains a hypothesis space consistent with the set of 
revealed samples). 
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In this protocol, learning stops when the consistent hypothesis space of the 
learner has size 1 and thus only contains the target hypothesis. For the pur- 
pose of this paper, the learner and the teacher are assumed to have unbounded 
computational power to compute updates to the hypothesis space and optimal 
sample sets (computational issues are treated in |15j and [6]). One intuitive per- 
spective for this learning protocol is that it is the best case scenario of exact 
learning with membership queries [T] , where the learner always guesses the best 
possible queries to find the target hypothesis. In the model of exact learning 
with membership queries, the teacher is removed from the protocol, and the 
learner is responsible for deciding which inputs to query for labels with the same 
goal of minimizing the number of samples until the target concept is discovered. 

We will now define a complexity measure (the teaching dimension) for learn- 
ing a fixed function / G T in the protocol of learning with a teacher. 

Definition 1. For a fixed / G J 7 , a minimum size teaching set TS(f) is a set 
of samples that uniquely identifies f among all functions in T with a size that 
is minimal among all possible teaching sets for f. The teaching dimension of 
f (with respect to T) is TDjr(f) = \TS(f)\. The teaching dimension of T is 
TD{F)=max fe ^TD{f). 

Intuitively, a teaching set is a shortest 'proof that certifies the identity of the 
initially hidden target concept. The teaching dimension is simply the maximum 
size of such a 'proof over the entire hypothesis space. 

To capture the difficulty of learning a hypothesis class as a whole we will de- 
fine average TD(J r ), which has some interesting combinatorial properties ([IP])- 

Definition 2. The average teaching dimension of T is aTD(J-) = ^~' JeJ: "| ^ . 

Computation in the Decision Tree Model. On the evaluation side, we 
will focus on 'proofs' that certify what is the value of a Boolean function / : 
X —> {0, 1} on an unknown input iel (with X usually {0, 1}"). We will thus 
focus on certificate complexity, which quantifies the difficulty of computing 
a Boolean function in the non-deterministic decision tree computation model. 

Let's assume / is fixed and known to both the nature and the agent. The 
protocol of interaction is as follows: nature chooses an input x G X (f(x) = b 
with b = or b = 1) without revealing it to the agent, and offers query access 
to the bits that define x. For any query i, it reveals the correct bit value Xi of 
the previously unknown bit i in x. Now we can define certificate complexity for 
a fixed input and for a function: 

Definition 3. For a fixed function f and a fixed and unknown input x with 
f{x) — b, a minimal b-certificate of f on x is a minimal size query set that fixes 
the value of f on x to b. The ^-certificate complexity Cc(x) is the size of 
such a minimal query set. 

Definition 4. The 1-certificate complexity of a Boolean function f is 

C 1 (/) = max xe x 1 Cj(x), where X 1 = {x G X\f(x) = 1}. Symmetrically, 

C°(f) = max xeX o Cj(x). And then C(f) = max(C 1 (/), C°(/)). 
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An intuitive way to interpret certificate complexity is that it quantifies what 
is the minimal number of examples (pairs (xi, value of x on Xi)) a friendly 
'teacher' (that knows x) must reveal to certify to an agent what is the value 
of / on x. 

While there is no previous definition for the notion of average certificate 
complexity in the literature, it is natural to define it in a similar manner (and 
for similar reasons) as for the average teaching dimension: 

Definition 5. The average 1-certificate complexity of a Boolean function f is 

a C 1 (/) = ^ xe \xi\ t ^ ■ We can symmetrically define aC°(f) and aC(f). 

We will now define block sensitivity, another well studied complexity measure 
for computing Boolean functions, as we will need it later in the paper. 

Definition 6. A Boolean function f is sensitive to a set S C [n] on x if f{x) ^ 
f(x^), where x^ s ^ is the input x with bits in S flipped to the opposite values. 
Then the block sensitivity of f on x, BSf(x) is the size of the largest set of 
disjoint sets Si, S2, Sk with the property that f is sensitive to each set Si,i £ 
[k] on x. Also, the block sensitivity of f is the maximum block sensitivity over 
all inputs x: BS(f) — max x£ xBS f{x). 

The definition of average block sensitivity is natural and follows similarly to 
the definition of average teaching dimension and average certificate complexity. 
It is worth noting though that Definition [7] is the same as that introduced in [4] 
(as other notions of average block sensitivity have been studied). 

Definition 7. The average block sensitivity of a Boolean function f is aBS(f) = 
\x\ 

2.1 Connecting Learning and Computation 

If in section [2] we set X = {0, l} 2 and we re-label / as J 7 , wc can interpret 
x € X as Boolean functions /j : {0, 1}™ —> {0, 1} (with each x being a truth 
table and thus a complete description of /j). Thus J 7 is a complete description 
of a hypothesis class with J-(fi) = 1 iff /, G T . The interaction protocol for both 
learning and evaluation proceeds in the same manner: at each step, a teacher 
reveals the value of an unknown function / on an input instance from {0, 1}". 
This is the sense in which we connect exact learning with a teacher and evaluation 
of Boolean functions in the non-deterministic decision tree computation model. 

To gain more intuition, if one imagines the function class as a matrix with the 
rows being all elements in {0, l} 2 and the columns being the inputs in {0, 1}™, 
then, fixing the interaction protocol to contain an optimal teacher aware of the 
identity of a hidden row, learning is about identifying the hidden row among the 
subset of rows that determine J 7 to be 1, while evaluation is about determining 
whether a hidden row is part of a chosen subset of rows (that define F) or not. 

Another intuitive perspective is through the lens of hypergraphs: fixing a 
vertex set, learning is about identifying a hidden edge from a set of edges that 
form a hypergraph (the function class J 7 ) while evaluation is about determining 
whether a given subset of vertices is an edge in the hypergraph or not. 
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2.2 Simple Examples 

In this section we will describe bounds for several simple concept classes with 
the goal of building intuition and exhibiting extreme values for C°, C 1 and TD. 

Powerset. This class is a trivial example for which Tp = 1 (all func- 
tions defined on {0,1}" are part of the concept class). It is easy to see that 
TD(Fp) = aTD(Fp) = 2™ since to locate a particular concept one needs to 
query all examples (otherwise there will be at least two concepts that are identi- 
cal on all previous instances). And C°(f) — C n (f) = 0, V/ since T-p is constant. 

Singletons. Ts = {/i}ie[2«] ; with fi(x) = 1 iff x = O 1 ^ 1 (i.e. the all-0 vector 
with the i-th coordinate flipped to 1). Then TD{Fs) = aTD(Fs) = 1 because it 
is enough to show the 1-bit of the target function to uniquely identify it among 
all fi. Nevertheless certifying that an / <G {0, l} 2 is (or isn't) in is hard 
in the worst case. If nature chooses the all-0 function / as a target, certifying 
that /o is not part of J-g will require seeing all 2™ inputs as, at any intermediate 
time, there will be at least a function in Fs consistent with f . So C°(Fs) = 2™. 
Similarly C^Fs) = 2™. 

Singletons with empty set. Fss = Fs U /o- For this function class, teach- 
ing becomes hard, as teaching f requires seeing all examples to differentiate 
it from the other functions. So TD(F) = 2™. Teaching the other functions is 
easy though, as showing the 1-bit is enough to certify what the function is. So 
aTD(F) — 2. The 0-certificate is small as any function not in Fse is evaluated 
to 1 for at least two examples, so showing these two examples is enough to cer- 
tify that the function is not in Fse and thus C°(Fss) = aC°(Fss) = 2. The 
1-certificate is large though since for any / G Tse, at least 2" — 1 examples must 
be shown, so ^{Fss) = aC l {F S£ ) = 2 n - 1. 

The dictator function. T-p = {/iligp 2 ™- 1 ]; with /, £ Tp, iff fi(xj) = 1 for 
some fixed Xj € {0, 1}™ (half of the Boolean functions defined on {0, 1}" are in 
Jx>). The learning problem is hard {TD{Tp>) = aTD{Fp) = 2" — 1) as it reduces 
to learning a Powerset class on 2™ — 1 bits. However, C°(f) = C 1 (/) = 1, V/, 
since membership to Tp can be decided if the value of the function on Xj is 
revealed. 



3 Teaching and Certifying Membership 

In this section we will study connections between teaching a function in a hy- 
pothesis class (conditioned on knowing that the function is indeed in the class) 
and proving that the function is part of the class (with no prior knowledge other 
than the knowledge of J 7 ). We will begin with a simple fact that is meant to 
illustrate the improvement in the next subsection: 

Fact 1 For any fixed instance space X , any function class T and any f <E J 7 , / : 
X -+ {0, 1}, < TD T (f) + C±(f) < 2\X\. 
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3.1 A Lower Bound Technique 

The upper bound from Fact [1] is almost tight in the worst case — the example of 
Singletons with empty set from section 12.21 shows that the sum of the two 
quantities can be 2\X\ — 1. The lower bound, on the other hand, is very loose as 
the next theorem shows: 

Theorem 2. For any fixed instance space X, any function class T and any 
f £ T, f : X -> {0, 1}", TDrtf) + > \X\ . 

Proof. Let's assume nature chooses f a £ T as the target hypothesis but does 
not reveal to the agent whether f a £ T or f a J ' . Let TS = TSj^(f a ) be the 
minimal teaching set for f a and CS = Cjr(f a ) be the smallest 1-certificate for 
fa- 
Let's assume that the goal of the teacher is to reveal the identity of f a to 
the learner using samples (x, f a {x)). But, since in this protocol the learner is not 
aware of whether f a is evaluated to 1 or by J 7 , it has to be the case that to 
uniquely identify it, it must see the value of the function on all x £ X (otherwise 
there will always be another function consistent with the examples seen so far 
that is evaluated to the opposite value by J 7 ). This argument is equivalent to 
learning a function in the case of the Powerset function class from Section 12.21 
Now let's describe an alternative strategy for the teacher that has the same 
effect of uniquely identifying f a . In the first epoch, the teacher reveals all the 
samples from CS. This will 'certify' to any consistent learner that f a £ T (since 
no / such that T{f) — is consistent with CS). We are now in the standard 
exact learning setting where the agent knows f a £ J- but does not know its 
identity. In the second epoch, the teacher will reveal all the samples in TS \ CS 
to the learner — since there is no point in presenting elements from their (possibly 
non-empty) intersection twice. This strategy will uniquely identify /„ without 
any prior knowledge about its membership to T . 

But we know that to uniquely identify f a we need exactly \X\ samples. So it 
has to be the case that \CS U (TS \ CS)\ = \CS U TS\ = \X\ < \CS\ + \TS\ = 
TD T (f a ) + C^fa). □ 

Since the above relation holds for any function / £ T it must hold for the 
average and worst case values of the complexity measures: 

Corollary 1. For any X, J 7 , f £ T, TD(F) + C 1 ^) > \X\ and aTD(F) + 
aC 1 ^) > \X\. 

3.2 Certifying Membership is (Usually) Hard 

In this section we will present a first application of Theorem[2] We will show that, 
for some of the standard concept classes in learning theory, certifying member- 
ship is hard, meaning all input variables need to be queried to determine whether 
an unknown function is part of the function class. 

In table [1] we present known lower and upper bounds for TD and aTD for a 
few hypothesis classes encountered in learning theory. It is important to note that 
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Table 1. Previous Results = m, T(f) = 1, iff / <E T, f : {0, 1}" -> {0, 1}) 





TD(J r ) 


aTD(J r ) 




Monotone 
Monomials 


6>(n) [6] 


0(n) [TT] 


«(2 n ) 


Monomials 


<9(n) [6] 


0(n) [11] 


J7(2 n ) 


Monotone k-term 
DNF 


n fc +fc [TT] 


O(fcn) QT] 


J?(2 n ) 


k-term DNF 




0(kn) [UJ 


n(2 n ) 


LTF 


0(2") [2] 


[n + l,n 2 ] [2] 




k- Juntas 


0(fc2 fc logn) 
^(2 fc logn) [TT] 


<9(2 fe ) [UJ 


n(2 n ) 



aTD scales at most logarithmically with the size of the input space (constant 
for fc-juntas and logarithmic for (monotone) conjunctions, (monotone) DNFs, 
LTFs). Thus, from Corollary [TJ for such concept classes T with aTD (J 7 ) = 
o{\X\), it follows that: C 1 ^) > aC 1 ^) > \X\ - aTD{F) = Q[\X\). 



3.3 Sparse Boolean Functions are Hard to Compute 

In this section we will present another application of Theorem [5] We will show 
that 'sparse' Boolean functions have large certificate complexity. By sparse we 
mean Boolean functions with a 'small' (roughly the size of \X\) Hamming weight 
(number of l's) in the output truth table. The following theorem makes this 
precise: 

Theorem 3. For any sets X and Y = 2 X and any Boolean function J- ', T :Y 
{0,1}, letm=\{f eY\T(f) = l}\. Ifm = o(\X\ 2 ), then C\T) = f2(\X\). 

Proof. We can interpret J 7 as a function class in the same manner as described in 
section X2.il Due to Theorem 1 from I10| we know there is a general upper bound 
on aTD{F) = 0{*Jm) = o{\X\). Then C* 1 ^ 7 ) > aC 1 ^) > \X\ - aTD(F) = 
\X\-o{\X\) = f2{\X\). □ 

It is worth mentioning that while the teaching dimension is used in the proof, 
the result is strictly about the hardness of computing J 7 . 

4 Teaching and Certifying Non-Membership 

In this section we will study bounds and relations between learning a function in 
a class and proving that the function is not in the class. The high-level intuition of 
why the two problems are similar comes from the similarity with the problem of 
searching an element in an unordered / ordered set: in the worst case, searching 
an element that is part of the set is as hard as searching the element if it is 
not in the set. We will present a set of results that are a first indication that, at 
least in the average case, the learning problem and the non-membership decision 
problem are similar. 

We will first deal with the worst case for C°. 
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Theorem 4. For any instance space X , any function class J- ' , with m = \J-\ < 
\X\, and any f # I C%(f) < m. 

Proof. From Bondy's theorem ([9], Theorem 12.1), we know there exists a set 
of coordinates T5*(J r ) of size m — 1 that, when revealed, uniquely identifies any 
target function / £ T . Let's choose f a £ CJ '. 

Let's first assume 3f £ T that is consistent with f a on TS. If we reveal the 
labels for all the coordinates in TS, f will be the only function consistent with 
f a . It must then be the case that revealing just another coordinate will lead to a 
certificate of size m for f a . If, on the other hand, for any / £ T , f is inconsistent 
with f a on TS, revealing the labels of all the coordinates in TS will lead to an 
upper bound for the size of any certificate that J- (fa) =0. □ 

The following corollary follows immediately: 

Corollary 2. For any function class T , with rn = {J 7 ] < \X\, C°(J-) < m. 

The bound on C°(F) is tight as the example of Singletons from [52] demon- 
strates, which thus settles the worst case bounds for C°. 



4.1 Bounds for aC° 

We will begin by proving a lower bound for aC. The result is a simple application 
of a theorem from [4] that puts a lower bound on the average block sensitivity 
of a Boolean function. 

Theorem 5. For any set X , there exists a function T a ■ 2 X — > {0, 1} such that 
aC(T a )>^/\X\. 

Proof. We will choose T a to be the Rubinstein function (more details in the proof 
for Theorem [6]) and apply the result from Proposition 6 in [4] that aBS(J- a ) > 
\/pf|. But, for any x £ X, BSjr a (x) < Cjr a (x) since a certificate for an input 
must contain at least a bit from each sensitive block otherwise the value of 
the function can be flipped by an adversary (see [S], Proposition 1). And so 
the relation must hold for the average case as well since v^A 7 ] < aBS^a) = 
^x|g^M < = aC [T a ). □ 

Now we will prove that the same property holds even if we restrict our at- 
tention to aC° which requires more work. 

Theorem 6. For any set X , there exists a function T a ■ 2 X — > {0, 1} such that 
aC°(T a ) = f2(y/\X\). 

Proof. Let \X\ = 4k 2 . We will choose again T a to be the Rubinstein function 
on X and we define it as: the 4fc 2 variables are partitioned in 2k pieces of size 
2k each, T a is 1 iff there exists at least one piece of the partition that has 2 
consecutive variables equal to 1 and the rest 0. We will count the number of 
inputs that are evaluated to 0, i.e. |A°|, with X a — {x £ X\F a {x) = 0}. 
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If we fix a piece of the partition, there are 2k configurations of the input 
variables in that piece that lead to J- a being evaluated to 1. And so 2 2k — 2k 
configurations don't "contribute" into making JF a to be 1. But, if in each piece 
there is a configuration that doesn't "contribute" to T a = 1, then T a = 0. There 
are thus (2 2k — 2k) 2k such configurations. 

We know from Theorem [5] that: 



2k = V\X\ < aC{F a ) = ^ML C ^ + Z*exo C{x) 



< 



2ik 2 

(2 4fc2 - (2 2fe - 2k) 2k )Ak 2 + {2 2k - 2k) 2k aC a {F a ) 



<(1-(1- ^) 2fe )4fc 2 + (1 " J£) 2fc «C°(*0 



where the second inequality follows by upper bounding all C(x),x e X 1 by the 
maximum possible certificate complexity, i.e. the size of X, 4k 2 . It can then be 
shown that: limfc-^oo (1 - (1 - ^) 2k )4k 2 = and limfe-v^l - ^) 2k = 1 and 
thus aC°{F a ) = Q{2k). ' ' □ 

Before we continue, it is interesting to remark the similarity (at a high level) 
of the result from Theorem |H] with Theorem 1 from 1C1 that describes a lower 
bound of y/\X\ on aTD(J r ). The lower bound tools are very different (Rubinstein 
function for aC° and projective planes for aTD), but, since they lead to an 
identical lower bound for related complexity measures, it would be interesting 
to see if there are some deep connections between them. 

We will now prove a (weak) lower bound tool for the relationship between 
aTD and aC°. 

Theorem 7. For any a < 2, there exists a junction class T a such that aTD (Fa) + 
aC°(T a )>a\X\. 

The theorem states that aTD and aC° can be simultaneously "large" (0(| X|)), 
a statement that is not immediately obvious (at least given the simple concept 
classes considered in Section E21 . 



Proof. Let's consider Tc = 1 — J~ss (the complement of the 'Singletons with 
empty set' concept class). Then aTD(T c ) = aTD(T S£ ) = \X\-1. Also aC°(T c ) = 
aG 1 {J : se) = \X\ - 1. So aTD(T c ) + aC {Fc) = 2\X\ - 2 and thus there exists 
no a such that the sum is smaller than a\X\. □ 



5 Connection with Membership Query Learning 

For the purpose of this section we will focus on learning and computing with 
queries (the membership query learning and deterministic decision tree compu- 
tation models) as this perspective will allow us to get more intuition about the 
structure of the function T . 
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In a manner similar to the way we have defined teaching dimension for the 
protocol of exact learning with a teacher, we will define MEMB(F) to be the 
(worst case) optimal learning bound for learning a function class F in the exact 
learning with membership queries protocol. Also, in a similar manner as for the 
certificate complexity definition in the non-deterministic decision tree model, we 
define D(F) to be the (worst case) optimal complexity of computing a Boolean 
function F in the deterministic decision tree model (for more formal definitions 
see p] and [5]). 

In [7] Hegedus introduced a complexity measure for bounding ME MB (J 7 ) 
called the Extended Teaching Dimension, which, as the name suggests, is 
inspired by the definition of the Teaching Dimension. We will define this 
complexity measure and then describe a result that establishes a connection 
between TD(F),C°(F) and MEMB(F). 

Definition 8 ([7]). A set S C X is a specifying set (SPS) for an arbitrary 
concept f £ 2 X with respect to the hypothesis class F if there is at most one 
concept in F that is consistent with f on S. Then the Extended Teaching 
Dimension (ETD) of J- is the minimal integer k such that there exists a spec- 
ifying set of size at most k for any concept f £ 2 X . 

Theorem 8. For any function class F, m&x{TD(F), C°(F)} < ETD (J 7 ) < 
max{TD(-F),C°(-F)} + 1. 

Proof. A specifying set for any / £ F is also a teaching set for /, as it uniquely 
identifies the function among all other functions in F. Also, a specifying set for 
any f c £ CF is 'almost' a certificate that f c is not in F as it differentiates f c from 
all other functions in F with the exception of at most one function. Revealing 
an extra instance is thus sufficient to differentiate f c from all / £ F and thus 
obtaining a certificate for f c £ CF. 

Let ETD (IF) = k for some fixed fc. Then there must be at least a function 
f a £ 2 X that has a minimal specifying set of size exactly k. Let's assume, wlog, 
that such a function is unique. Let's first consider the case of f a £ F. Since 
ETD(F) = fc, it means all / £ F have a teaching set of size < fc. f a can't have 
a teaching set with a size smaller than k since such a teaching set would also 
be a specifying set of size < fc which is not possible given the assumption of 
uniqueness. And since TDp(f a ) = fc it means TD(F) = fc. Now let's pick an 
arbitrary f c £ CF. Since f a is the unique function with a specifying set of size 
fc, it means that \SPSjr(f c )\ < k and thus C°(/ c ) < \SPSjr(f c )\ + 1 < fc which 
thus proves the desired relation for this case. 

The second case with f a £ CF is treated similarly. □ 

Theorem [8] directly leads to the following corollary (since ETD(F) is a lower 
bound for MEMB(F)) stating that learning with membership queries is at least 
as hard as certifying non-membership: 

Corollary 3. For any function class F, C°(F) < MEMB(F). 
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5.1 Is T weakly symmetric for natural learning problems? 

We will now give a result that follows from Theorem [2] and connects learning 
and computation in the query model: 

Corollary 4. For any instance space X , any fixed function class T and any 
f G T,f : X -»■ {0, 1}, MEMBjr(f) + D T {f) >\X\. 

The proof is immediate as the teaching dimension is a lower bound for the 
optimal membership query bound [6] and the certificate complexity is a lower 
bound for the decision tree complexity [5] and so we can just apply theorem [2] 
to get the desired relation. 

A natural question is how useful is this bound for standard concept classes 
from learning theory. It is this question that we address in this subsection where 
we describe an interesting structural property of T . 

We will begin with a few (informal) definitions (see [12] for a complete refer- 
ence). In the deterministic decision tree computation model, a Boolean function 
is labeled evasive if, in the worst case, all input variables need to be queried 
to determine the value of the function. Several results describe sufficient condi- 
tions for large classes of Boolean functions to be evasive. An interesting class of 
Boolean functions are graph properties. A graph property is a class of graphs 
(on a fixed number of vertices) that remains unchanged for any permutation of 
the vertices (graph connectivity for example) . The variables for a graph property 
are the possible edges of a graph. 

By construction, a graph property can be encoded as a weakly symmet- 
ric Boolean function on the edges. Weakly symmetric Boolean functions are a 
generalization of symmetric Boolean functions. A Boolean function is weakly 
symmetric if, for any pair of variables, there exists a permutation of all vari- 
ables that permutes the variables in the pair, such that the function remains 
unchanged. 

Graph properties are weakly symmetric since all permutations on the vertex 
set induce a set of permutations of the edge set which leave the function un- 
changed. A general hardness result (the Rivest-Vuillemin theorem [13]) for 
computing weakly symmetric Boolean functions (and implicitly graph proper- 
ties) states that any non-constant weakly symmetric Boolean function defined 
on a number of variables that is the power of a prime number is evasive. 

This brings us to the point of connection with the Boolean function T as 
we've defined it in section 12.11 The intuition is that in the same way that per- 
muting vertices doesn't change a graph property, the input variables for /'s don't 
change the Boolean function J 7 , or in other words the definition of a function 
class (T) is invariant to permutations of x~a (the bits of x a € X). Moreover, by 
construction, F has a number of inputs which is a power of a prime (2™ - all 
the possible inputs that can be defined on the original n input variables) and 
natural concept class are not trivial. 

So, if and when the intuition that JF is weakly symmetric is correct, we can 
actually apply the aforementioned result to show that T is evasive and in turn 
that D{T) = f2(\X\) (and implicitly that C{T) = Hi^/lX})). In such situations 
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the bound from Theorem |4] is not very useful as it puts no constraints on the 
optimal membership query bound. 

Interestingly though, the above intuition is false in general. For example the 
following theorem shows a natural concept class that leads to a function J- that 
is not weakly symmetric. 

Theorem 9. If mJ-k is the class of monotone monomials of size exactl]^ k, 
rruFh ( viewed as a Boolean function with input 'bits ' from X and inputs from 
2 ) is not weakly symmetric. 

Proof. Let X = {0, 1}™ and for any x E X let |x| = \{i\x® = l}\ be the weight 
of x (the number of bits in x that are 1). 

Let's consider fA € rriTk and x a € X such that fA{x a ) — 1. Then it must 
be the case that \x a \ > k (there exist k bits among the n bits of x a that are 1). 
Let's consider xi, be an input in X such that < k. Then /^(^b) = since Xb 
can't encode a monotone monomial of size k. 

Now let's consider an arbitrary permutation n that changes x a with Xb- This 
means that it will induce a Boolean function f* that will be evaluated to 1 for 
Xb- But such a function can't be a monotone monomial of size exactly k since 
\xb\ < k and can't be evaluated to 1. This means that any permutation that 
changes x a and Xb will change the function mTu- So we have found a pair of 
variables for which no permutation of the other variables (that permutes the 
two) leaves mTk unchanged. Thus mT^ can't be weakly symmetric. □ 

For such a concept class there is thus hope that a relation like the one from 
Theorem 0] might be useful. However there are interesting concept classes that 
lead to a weakly symmetric function J 7 . An example is the class of monomials 
of size exactly k: 

Theorem 10. If J~k is the class of monomials of size exactly k, Tk is weakly 
symmetric. 

Proof. Let X = {0, 1}™ and V the extended set of 2n variables indexed in [n] 
that contains the variables and their complements: V — {y^\i € [2rc]} with 
y{%) _ x {%) f or j g j n j anc j y(i) _ -, x w f or j £ [n + l,2n}. 

Let's fix x a ,Xb & X and let / = {i|a;a' ) = x^ '} be the set of variables that 
have identical values for x a and Xb and D — {i\Xa x^}, the complement of /. 
We will construct a permutation a^ a ^ over V based on / and D that will induce 
a permutation 7r CT over X which will in turn induce a permutation H a over 2 X . 
II a will have the property that J r k(n a (f)) = Fk{f), V/ i.e. the permutation that 
a ( a M induces on the set of possible functions / leaves T unchanged, which is 
what we need to show. 

For any i € I, let a^ a ' b \i) = i and for any i £ D, let a^ a,b \i) = i + n. In other 
words any variable on which x a and Xb agree will remain unchanged, while any 

2 A Boolean function is representable by a monomial of size exactly k if it has a 
monomial representation of size k and no monomial representation for any k' < k. 
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variable for which there is disagreement will be negated. From the construction 
of a it follows that TTa-(x a ) = Xb and ir a (xb) = x a , as desired (where, as above, 
n a is the permutation induced by a on X). 

We will first consider /'s such that -Fk{f) = 1, which means / is a monomial 
of size k. In the expression of /, a will either leave a variable unchanged or it 
will replace it with its negation. But that means that 7T (T (/) (where as above 
II a is the induced permutation over 2 X ) will still be a monomial of size exactly 
k (albeit a different one), so Tk{n a {f)) = 1. 

The second case considers functions / such that J-k(f) — 0. Let s be the 
number of terms in the minimal DNF representation of /. Is s = 1 then / is 
representable by a monomial and, since Fk{f) — 0, / is a fc'-monomial with 
k' < k or k' > k. But negating any subset of variables from / will not increase 
or decrease the number of variables in the conjunction (as the variables are 
uniquely represented in the conjunction and the expression can't be reduced in 
any way), so F k {n a {f)) = for s = 1. 

If s > 1, let's assume that n a (f) has an s'-term DNF representation for some 
s' < s. But this means that 7T (T (7T (T (/)) (we apply the induced permutation 
n a a second time) will also have an s'-term DNF representation. But since 
II a {II a {f)) = f (as applying Il a two times only doubly negates a subset of the 
variables), we get a contradiction. So n a (f) has an s-term DNF representation 
with s > 1 and can't be a monomial. Thus J-k(n<j(f)) = 0. □ 

It is easy to extend the proof and show that the class of monomials of size 
at most k also lead to weakly symmetric T functions. 

6 Discussion 

As mentioned in the introduction, the combination of evaluation and learning 
is a characteristic of property testing. A natural question is whether we can 
design an exact (i.e. non-distributional) property testing protocol that is useful. 
As we saw in Section G2 in the exact setting we are considering, whenever we 
will be able to positively test for membership, the learning problem will be hard 
and vice- versa. So, as compared to the commonly used property testing protocol 
(which is defined with respect to some distribution over the instance space), 
we can't expect two-sided property testers (that certify both membership and 
non-membership) to be combined with exact learners successfully. But, it is still 
possible to combine learners and algorithms that certify non-membership with 
potential applications to agnostic exact learning. 

Regarding other future directions, one natural thing to study is a general 
upper bound on aC° that only depends on the size of the concept class. Moreover, 
as mentioned in the text, the lower bounds for aC° and aTD use different tools 
to obtain a similar result, and these tools are often encountered in proofs for 
lower bounds, so perhaps understanding their connections would be beneficial 
in its own right. 

Another interesting research direction is to study bounds for C° and aC° 
for particular concept classes. Several results exist ([7] and [8]) for C° but they 
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do not cover all natural concept classes. Another hope is that deriving upper 
bounds for C° and aC° would in turn lead to a deeper understanding of the 
gap between the worst case upper bound for aTD and the upper bounds for 
particular concept classes. 

On another topic, as described in Section [5j interesting connections exist 
between the membership query learning and deterministic decision tree frame- 
works. One interesting direction would be to further investigate what other func- 
tion classes lead to weakly symmetric J 7 functions, as both positive and negative 
answers would potentially help in revealing new connections between learning 
and evaluation. 
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